如今,渴望数据的深神经网络(DNNS)的创建者搜索互联网训练饲料,使用户几乎无法控制或了解何时将其数据用于模型培训。为了使用户能够抵消不需要的数据使用,我们设计,实施和评估一个实用系统,该系统使用户能够检测其数据是否用于培训DNN模型。我们展示了用户如何创建我们称为同位素的特殊数据点,该数据点在培训期间将“伪造功能”引入DNN中。仅查询访问训练的模型,并且对模型培训过程不了解或对数据标签的控制,用户可以应用统计假设测试来检测模型是否通过对用户的培训进行培训来了解与其同位素相关的虚假特征数据。这有效地将DNNS对记忆和虚假相关性的脆弱性变成了数据出处的工具。我们的结果证实了在多种设置中的功效,检测并区分了数百种具有高精度的同位素。我们进一步表明,我们的系统在公共ML-AS-AS-Service平台和较大的模型(例如ImageNet)上工作,可以使用物理对象代替数字标记,并且通常对几种自适应对策保持坚固。
translated by 谷歌翻译
我们调查对神经序列到序列(SEQ2SEQ)模型的新威胁:训练时间攻击使模型“自旋”的输出,以支持对抗的选择情绪或观点,但仅在输入包含时逆境触发词。例如,旋转的摘要模型将输出提到某些个人或组织名称的文本的正摘要。模型纺纱使得宣传的AS-A-Service。对手可以创建为所选触发产生所需的旋转的自定义语言模型,然后部署它们以生成虚假信息(平台攻击),或者将它们注入ML培训管道(供应链攻击),将恶意功能转移到下游模型。在技​​术术语中,模型纺纱将一个“Meta-Backdoor”引入模型中。虽然传统的后门导致模型在具有触发器的输入上产生不正确的输出,但旋转模型的输出保留上下文并维持标准精度度量,但也满足了对手(例如,积极情绪)选择的元任务。为了证明模型纺丝的可行性,我们开发了一种新的回溯技术。它将对手元任务堆叠到SEQ2SEQ模型上,将所需的元任务输出返回到嵌入空间中的所需的元任务输出,我们称之为“伪字”,并使用伪字来换档SEQ2Seq模型的整个输出分布。我们评估了对语言生成,摘要和翻译模型的攻击,具有不同的触发器和诸如情感,毒性和征集等方面的触发器和荟萃任务。旋转模型在满足对抗的元任务时保持其准确性指标。在供应链中攻击旋转转移到下游型号。最后,我们提出了一个黑匣子,元任务独立的防御,以检测选择性地将旋转旋转到具有特定触发的输入的模型。
translated by 谷歌翻译
Federated learning enables thousands of participants to construct a deep learning model without sharing their private training data with each other. For example, multiple smartphones can jointly train a next-word predictor for keyboards without revealing what individual users type.Federated models are created by aggregating model updates submitted by participants. To protect confidentiality of the training data, the aggregator by design has no visibility into how these updates are generated. We show that this makes federated learning vulnerable to a model-poisoning attack that is significantly more powerful than poisoning attacks that target only the training data.A malicious participant can use model replacement to introduce backdoor functionality into the joint model, e.g., modify an image classifier so that it assigns an attacker-chosen label to images with certain features, or force a word predictor to complete certain sentences with an attacker-chosen word. These attacks can be performed by a single participant or multiple colluding participants. We evaluate model replacement under different assumptions for the standard federated-learning tasks and show that it greatly outperforms training-data poisoning.Federated learning employs secure aggregation to protect confidentiality of participants' local models and thus cannot prevent our attack by detecting anomalies in participants' contributions to the joint model. To demonstrate that anomaly detection would not have been effective in any case, we also develop and evaluate a generic constrain-and-scale technique that incorporates the evasion of defenses into the attacker's loss function during training. ! "#$%" train & % '() * '()! +%$,-##.
translated by 谷歌翻译
Collaborative machine learning and related techniques such as federated learning allow multiple participants, each with his own training dataset, to build a joint model by training locally and periodically exchanging model updates. We demonstrate that these updates leak unintended information about participants' training data and develop passive and active inference attacks to exploit this leakage. First, we show that an adversarial participant can infer the presence of exact data points-for example, specific locations-in others' training data (i.e., membership inference). Then, we show how this adversary can infer properties that hold only for a subset of the training data and are independent of the properties that the joint model aims to capture. For example, he can infer when a specific person first appears in the photos used to train a binary gender classifier. We evaluate our attacks on a variety of tasks, datasets, and learning configurations, analyze their limitations, and discuss possible defenses.
translated by 谷歌翻译
We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data record and black-box access to a model, determine if the record was in the model's training dataset. To perform membership inference against a target model, we make adversarial use of machine learning and train our own inference model to recognize differences in the target model's predictions on the inputs that it trained on versus the inputs that it did not train on.We empirically evaluate our inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon. Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks. We then investigate the factors that influence this leakage and evaluate mitigation strategies.
translated by 谷歌翻译
In this paper, we formulate the problem of predicting a geolocation from free text as a sequence-to-sequence problem. Using this formulation, we obtain a geocoding model by training a T5 encoder-decoder transformer model using free text as an input and geolocation as an output. The geocoding model was trained on geo-tagged wikidump data with adaptive cell partitioning for the geolocation representation. All of the code including Rest-based application, dataset and model checkpoints used in this work are publicly available.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
目标。借助(子)毫米观测值的大量分子发射数据和詹姆斯·韦伯(James Webb)空间望远镜红外光谱,访问原磁盘的化学成分的快进模型至关重要。方法。我们使用了热化学建模代码来生成各种多样的原行星磁盘模型。我们训练了一个最初的邻居(KNN)回归剂,以立即预测其他磁盘模型的化学反应。结果。我们表明,由于所采用的原行业磁盘模型中局部物理条件之间的相关性,可以仅使用一小部分物理条件来准确地重现化学反应。我们讨论此方法的不确定性和局限性。结论。所提出的方法可用于对线排放数据的贝叶斯拟合,以从观测值中检索磁盘属性。我们提出了在其他磁盘化学模型集上再现相同方法的管道。
translated by 谷歌翻译
神经算法推理的基石是解决算法任务的能力,尤其是以一种概括分布的方式。尽管近年来,该领域的方法学改进激增,但它们主要集中在建立专家模型上。专业模型能够学习仅执行一种算法或具有相同控制流骨干的算法的集合。相反,在这里,我们专注于构建通才神经算法学习者 - 单个图形神经网络处理器,能够学习执行各种算法,例如分类,搜索,动态编程,路径触发和几何学。我们利用CLRS基准来凭经验表明,就像在感知领域的最新成功一样,通才算法学习者可以通过“合并”知识来构建。也就是说,只要我们能够在单任务制度中学习很好地执行它们,就可以以多任务的方式有效地学习算法。在此激励的基础上,我们为CLR提供了一系列改进,对CLR的输入表示,培训制度和处理器体系结构,将平均单任务性能提高了20%以上。然后,我们进行了多任务学习者的彻底消融,以利用这些改进。我们的结果表明,一位通才学习者有效地结合了专家模型所捕获的知识。
translated by 谷歌翻译
作为标准本地模型和中央模型之间的中间信任模型,差异隐私的洗牌模型已引起了人们的极大兴趣[EFMRTT19;CSUZZ19]。该模型的关键结果是,随机洗牌本地随机数据放大了差异隐私保证。这种放大意味着对数据匿名贡献的系统提供了更大的隐私保证[BEMMRLRKTS17]。在这项工作中,我们通过在理论和数字上逐渐改造结果来改善最新隐私放大的状态。我们的第一个贡献是对LDP Randomizers洗牌输出的R \'enyi差异隐私参数的首次渐近最佳分析。我们的第二个贡献是通过改组对隐私放大的新分析。该分析改进了[FMT20]的技术,并导致所有参数设置中的数值范围更紧密。
translated by 谷歌翻译